Discovering Parallel Text from the World Wide Web
نویسندگان
چکیده
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents from the World Wide Web. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. The experiment conducted to a multilingual Web site shows the effectiveness of the system.
منابع مشابه
Discovering Image-Text Associations for Cross-Media Web Information Fusion
The diverse and distributed nature of the information published on the World Wide Web has made it difficult to collate and track information related to specific topics. Whereas most existing work on web information fusion has focused on multiple document summarization, this paper presents a novel approach for discovering associations between images and text segments, which subsequently can be u...
متن کاملDiscovering and Tracking Events From News, Blogs and Microblogs on the Web
Using three data sources, news, blogs, and microblogs, this study proposes a framework for discovering and tracking events embedded in free form online text. Existing methods for text mining are discussed for the three sources. Because three sources have different perspective, event analysis, region-topic model and rare keywords are proposed respectively. In order to integrate three data source...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کاملBITS: A Method for Bilingual Text Search over the Web
Parallel corpus are valuable resource for machine translation, multilingual text retrieval, language education and other applications, but for various reasons, its availability is very limited at present. Noticed that the World Wide Web is a potential source to mine parallel text, researchers are making their efforts to explore the Web in order to get a big collection of bitext. This paper pres...
متن کاملThe Web as a Parallel Corpus
Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structur...
متن کامل